Add datasets and documentation #1

winfried-ripken · 2022-02-03T11:26:53Z

Using Readthedocs: setup structure from their sphinx template
Copy PR template and issue templates from mxlabs-dataset
Adding Datasets + Documentation using this template created by Thomas
- Kaggle Casting Quality: prototype dataset card
- Imagenet: empty docu

TODO

Replace squirrel references with squirrel-core once available
Document preprocessing with spark

winfried-ripken · 2022-02-03T19:00:01Z

Would be happy to get your feedback on this prototype dataset card

TiansuYu

Thanks, looks really good to me.

docs/source/conf.py

src/squirrel_datasets_core/datasets/imagenet/README.rst

src/squirrel_datasets_core/datasets/kaggle_casting_quality/README.rst

AlpAribal

Looking good already! Dataset card is quite nice 👍

Depending on when this PR gets ready to merge, you may want to refrain from adding the driver .py files, since there will be a lot of changes with the new squirrel store and driver apis.

.gitignore

.readthedocs.yaml

docs/source/conf.py

requirements.in

src/squirrel_datasets_core/datasets/kaggle_casting_quality/README.rst

ThomasWollmann · 2022-02-10T10:23:35Z

@winfried-loetzsch Very cool! Would be nice to have the templates also in the devtool, so that other projects get this when rendering the skeleton.

.gitignore

ThomasWollmann · 2022-02-10T10:31:14Z

There shall also be tests.

docs/source/index.rst

winfried-ripken · 2022-02-21T15:21:39Z

Description

Create main structure of the repo
Add drivers for Camvid, ds_bowl2018, imagenet, kaggle_casting_quality, monthly_german_tweets as well as hub, huggingface, torchvision drivers
Add prototype dataset card + reference it in sphinx docu
Parse dataset cards to extract machine-readable information

from squirrel_datasets_core.datasets.kaggle_casting_quality import DATASET_ATTRIBUTES
print(DATASET_ATTRIBUTES)

will give

{
  'pretty_name': 'Kaggle Casting Quality', 
  'languages': '[]', 
  'licenses': 'CC BY-NC-ND 4.0', 
  'size_categories': '1K<n<10K', 
  'task_categories': 'image-classification'
}

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring including code style reformatting
Other (please describe):

Checklist:

I have read the [contributing guideline doc] () (external only)
I have signed the [CLA] () (external only)
Lint and unit tests pass locally with my changes
I have kept the PR small so that it can be easily reviewed
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
All dependency changes have been reflected in the pip requirement files.

TiansuYu

So far everything LGTM. Thanks a lot for the work!
Also interesting to see that all the 'data cards' are in each dataset folder instead of in docs. Is there any reason to prefer this way?

src/squirrel_datasets_core/datasets/camvid/driver.py

TiansuYu · 2022-02-24T07:48:33Z

src/squirrel_datasets_core/squirrel_plugin.py

+        except AttributeError:
+            pass
+
+    return drivers


We will add squirrel_sources here in the future, right?

We need to exclude the sources if they point to our internal gcp buckets, but I think we could include some of the huggingface sources for example. Let's discuss this on Monday :)

Sounds good.

AlpAribal

LGTM, left minor remarks.

AlpAribal · 2022-02-24T09:44:10Z

docs/source/conf.py

+intersphinx_mapping = {
+    "python": ("https://docs.python.org/3/", None),
+    "sphinx": ("https://www.sphinx-doc.org/en/master/", None),
+}


Maybe for future: We can add https://squirrel.readthedocs.io here

I didn't know we have already created the page. Looks nice!

AlpAribal · 2022-02-24T09:46:25Z

requirements.in

+torchvision
+scipy
+pyspark==3.2.0
+hydra-core>=1.1.0
+hub


In squirrel-datasets, we have different flavors for e.g. torchvision and hub. Do we have them all in a single reqs.in here on purpose?

Ok, let's discuss then in the group if it makes sense to have the flavors or not :)

AlpAribal · 2022-02-24T09:47:27Z

setup.py

+
+# generate extras based on requirements files
+extras_require = dict()
+for a_extra in ["dev", "preprocessing", "torchvision", "hub"]:


We can remove some of these extras if we dont want flavors

winfried-ripken · 2022-02-24T16:57:46Z

So far everything LGTM. Thanks a lot for the work! Also interesting to see that all the 'data cards' are in each dataset folder instead of in docs. Is there any reason to prefer this way?

This was brought up by @ThomasWollmann, mainly to simplify the process of adding a dataset (have everything in one place) and be able to navigate in github and see the readme/dataset card while browsing through the implementations of the datasets

Winfried Loetzsch added 4 commits February 3, 2022 11:50

apply readthedocs template

832a85d

PR and issue templates

3eb9697

copy imagenet+kaggle drivers+dependencies

1269e27

adding dataset cards

d4588f9

This comment was marked as resolved.

Sign in to view

edit data card

7a12841

winfried-ripken requested review from AlpAribal and TiansuYu February 3, 2022 19:00

TiansuYu reviewed Feb 4, 2022

View reviewed changes

docs/source/conf.py Outdated Show resolved Hide resolved

src/squirrel_datasets_core/datasets/imagenet/README.rst Outdated Show resolved Hide resolved

src/squirrel_datasets_core/datasets/kaggle_casting_quality/README.rst Outdated Show resolved Hide resolved

Winfried Loetzsch added 3 commits February 7, 2022 16:09

remove docu versioning

d6126e4

dataset card - clean up

bcaa761

add relevant public datasets. Standardize naming

b19c180

winfried-ripken requested a review from ThomasWollmann February 7, 2022 16:18

This comment was marked as resolved.

Sign in to view

AlpAribal reviewed Feb 7, 2022

View reviewed changes

ThomasWollmann reviewed Feb 10, 2022

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

ThomasWollmann reviewed Feb 10, 2022

View reviewed changes

docs/source/index.rst Show resolved Hide resolved

Winfried Loetzsch added 7 commits February 21, 2022 12:06

update with changes from squirrel-datasets

a0439ba

using our own gitignore

3d7c99c

First version of setup.py. Require Python 3.8

7f9dbe3

Add requirements

47f82bd

Remove preprocessing logic. Will add separate PR

f8c6866

Add test for Hub driver

a10d35f

Extend documentation by Contribute section

3ee958a

Parse attributes in dataset cards + Lint

d1e40dd

winfried-ripken marked this pull request as ready for review February 21, 2022 15:21

winfried-ripken requested review from AlpAribal and TiansuYu February 21, 2022 15:29

TiansuYu approved these changes Feb 24, 2022

View reviewed changes

Change constant to upper case

066f460

AlpAribal approved these changes Feb 24, 2022

View reviewed changes

Update Readme.rst

62183bd

Add intersphinx reference to squirrel docs

cdba8bb

winfried-ripken merged commit f5a3d25 into main Feb 24, 2022

winfried-ripken deleted the wl-add-datasets-and-documentation branch February 24, 2022 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add datasets and documentation #1

Add datasets and documentation #1

winfried-ripken commented Feb 3, 2022 •

edited

Loading

This comment was marked as resolved.

winfried-ripken commented Feb 3, 2022

TiansuYu left a comment

This comment was marked as resolved.

AlpAribal left a comment

ThomasWollmann commented Feb 10, 2022

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

ThomasWollmann commented Feb 10, 2022

winfried-ripken commented Feb 21, 2022

TiansuYu left a comment •

edited

Loading

TiansuYu Feb 24, 2022

winfried-ripken Feb 24, 2022

TiansuYu Feb 24, 2022

AlpAribal left a comment

AlpAribal Feb 24, 2022

TiansuYu Feb 24, 2022

AlpAribal Feb 24, 2022

winfried-ripken Feb 24, 2022

AlpAribal Feb 24, 2022

winfried-ripken commented Feb 24, 2022

Add datasets and documentation #1

Add datasets and documentation #1

Conversation

winfried-ripken commented Feb 3, 2022 • edited Loading

This comment was marked as resolved.

winfried-ripken commented Feb 3, 2022

TiansuYu left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

AlpAribal left a comment

Choose a reason for hiding this comment

ThomasWollmann commented Feb 10, 2022

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

ThomasWollmann commented Feb 10, 2022

winfried-ripken commented Feb 21, 2022

Description

Type of change

Checklist:

TiansuYu left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlpAribal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winfried-ripken commented Feb 24, 2022

winfried-ripken commented Feb 3, 2022 •

edited

Loading

TiansuYu left a comment •

edited

Loading